Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 19 de 19
Filter
Add more filters










Publication year range
1.
Nucleic Acids Res ; 52(D1): D154-D163, 2024 Jan 05.
Article in English | MEDLINE | ID: mdl-37971293

ABSTRACT

We present a major update of the HOCOMOCO collection that provides DNA binding specificity patterns of 949 human transcription factors and 720 mouse orthologs. To make this release, we performed motif discovery in peak sets that originated from 14 183 ChIP-Seq experiments and reads from 2554 HT-SELEX experiments yielding more than 400 thousand candidate motifs. The candidate motifs were annotated according to their similarity to known motifs and the hierarchy of DNA-binding domains of the respective transcription factors. Next, the motifs underwent human expert curation to stratify distinct motif subtypes and remove non-informative patterns and common artifacts. Finally, the curated subset of 100 thousand motifs was supplied to the automated benchmarking to select the best-performing motifs for each transcription factor. The resulting HOCOMOCO v12 core collection contains 1443 verified position weight matrices, including distinct subtypes of DNA binding motifs for particular transcription factors. In addition to the core collection, HOCOMOCO v12 provides motif sets optimized for the recognition of binding sites in vivo and in vitro, and for annotation of regulatory sequence variants. HOCOMOCO is available at https://hocomoco12.autosome.org and https://hocomoco.autosome.org.


Subject(s)
Databases, Genetic , Gene Expression Regulation , Protein Interaction Domains and Motifs , Transcription Factors , Animals , Humans , Mice , Binding Sites/genetics , Nucleotide Motifs , Transcription Factors/genetics , Transcription Factors/metabolism , Internet , Protein Interaction Domains and Motifs/genetics
2.
Nucleic Acids Res ; 51(D1): D564-D570, 2023 01 06.
Article in English | MEDLINE | ID: mdl-36350659

ABSTRACT

We present an update of EpiFactors, a manually curated database providing information about epigenetic regulators, their complexes, targets, and products which is openly accessible at http://epifactors.autosome.org. An updated version of the EpiFactors contains information on 902 proteins, including 101 histones and protamines, and, as a main update, a newly curated collection of 124 lncRNAs involved in epigenetic regulation. The amount of publications concerning the role of lncRNA in epigenetics is rapidly growing. Yet, the resource that compiles, integrates, organizes, and presents curated information on lncRNAs in epigenetics is missing. EpiFactors fills this gap and provides data on epigenetic regulators in an accessible and user-friendly form. For 820 of the genes in EpiFactors, we include expression estimates across multiple cell types assessed by CAGE-Seq in the FANTOM5 project. In addition, the updated EpiFactors contains information on 73 protein complexes involved in epigenetic regulation. Our resource is practical for a wide range of users, including biologists, bioinformaticians and molecular/systems biologists.


Subject(s)
Databases, Genetic , Epigenesis, Genetic , Humans , Histones/genetics , Histones/metabolism , Protamines , RNA, Long Noncoding/genetics , RNA, Long Noncoding/metabolism
3.
Nucleic Acids Res ; 50(W1): W51-W56, 2022 07 05.
Article in English | MEDLINE | ID: mdl-35446421

ABSTRACT

We present ANANASTRA, https://ananastra.autosome.org, a web server for the identification and annotation of regulatory single-nucleotide polymorphisms (SNPs) with allele-specific binding events. ANANASTRA accepts a list of dbSNP IDs or a VCF file and reports allele-specific binding (ASB) sites of particular transcription factors or in specific cell types, highlighting those with ASBs significantly enriched at SNPs in the query list. ANANASTRA is built on top of a systematic analysis of allelic imbalance in ChIP-Seq experiments and performs the ASB enrichment test against background sets of SNPs found in the same source experiments as ASB sites but not displaying significant allelic imbalance. We illustrate ANANASTRA usage with selected case studies and expect that ANANASTRA will help to conduct the follow-up of GWAS in terms of establishing functional hypotheses and designing experimental verification.


Subject(s)
Polymorphism, Single Nucleotide , Transcription Factors , Alleles , Binding Sites , Genome-Wide Association Study , Protein Binding , Transcription Factors/chemistry , Transcription Factors/metabolism , DNA-Binding Proteins
5.
Cell Rep ; 35(10): 109221, 2021 06 08.
Article in English | MEDLINE | ID: mdl-34107262

ABSTRACT

Somatic mutations in regulatory sites of human stem cells affect cell identity or cause malignant transformation. By mining the human genome for co-occurrence of mutations and transcription factor binding sites, we show that C/EBP binding sites are strongly enriched with [C > T]G mutations in cancer and adult stem cells, which is of special interest because C/EBPs regulate cell fate and differentiation. In vitro protein-DNA binding assay and structural modeling of the CEBPB-DNA complex show that the G·T mismatch in the core CG dinucleotide strongly enhances affinity of the binding site. We conclude that enhanced binding of C/EBPs shields CpG·TpG mismatches from DNA repair, leading to selective accumulation of [C > T]G mutations and consequent deterioration of the binding sites. This mechanism of targeted mutagenesis highlights the effect of a mutational process on certain regulatory sites and reveals the molecular basis of putative regulatory alterations in stem cells.


Subject(s)
Adult Stem Cells/metabolism , CCAAT-Enhancer-Binding Protein-alpha/metabolism , Dinucleoside Phosphates/metabolism , Neoplasms/genetics , Humans , Mutation
6.
Nat Commun ; 12(1): 2751, 2021 05 12.
Article in English | MEDLINE | ID: mdl-33980847

ABSTRACT

Sequence variants in gene regulatory regions alter gene expression and contribute to phenotypes of individual cells and the whole organism, including disease susceptibility and progression. Single-nucleotide variants in enhancers or promoters may affect gene transcription by altering transcription factor binding sites. Differential transcription factor binding in heterozygous genomic loci provides a natural source of information on such regulatory variants. We present a novel approach to call the allele-specific transcription factor binding events at single-nucleotide variants in ChIP-Seq data, taking into account the joint contribution of aneuploidy and local copy number variation, that is estimated directly from variant calls. We have conducted a meta-analysis of more than 7 thousand ChIP-Seq experiments and assembled the database of allele-specific binding events listing more than half a million entries at nearly 270 thousand single-nucleotide polymorphisms for several hundred human transcription factors and cell types. These polymorphisms are enriched for associations with phenotypes of medical relevance and often overlap eQTLs, making candidates for causality by linking variants with molecular mechanisms. Specifically, there is a special class of switching sites, where different transcription factors preferably bind alternative alleles, thus revealing allele-specific rewiring of molecular circuitry.


Subject(s)
Alleles , Genome, Human , Regulatory Sequences, Nucleic Acid/genetics , Transcription Factors/metabolism , Chromatin/metabolism , Databases, Genetic , Gene Dosage , Gene Expression Regulation/genetics , Genome-Wide Association Study , Humans , Nucleotide Motifs , Phenotype , Polymorphism, Single Nucleotide , Protein Binding , Quantitative Trait Loci
7.
Methods Mol Biol ; 2252: 269-294, 2021.
Article in English | MEDLINE | ID: mdl-33765281

ABSTRACT

During translation, the rate of ribosome movement along mRNA varies. This leads to a non-uniform ribosome distribution along the transcript, depending on local mRNA sequence, structure, tRNA availability, and translation factor abundance, as well as the relationship between the overall rates of initiation, elongation, and termination. Stress, antibiotics, and genetic perturbations affecting composition and properties of translation machinery can alter the ribosome positional distribution dramatically. Here, we offer a computational protocol for analyzing positional distribution profiles using ribosome profiling (Ribo-Seq) data. The protocol uses papolarity, a new Python toolkit for the analysis of transcript-level short read coverage profiles. For a single sample, for each transcript papolarity allows for computing the classic polarity metric which, in the case of Ribo-Seq, reflects ribosome positional preferences. For comparison versus a control sample, papolarity estimates an improved metric, the relative linear regression slope of coverage along transcript length. This involves de-noising by profile segmentation with a Poisson model and aggregation of Ribo-Seq coverage within segments, thus achieving reliable estimates of the regression slope. The papolarity software and the associated protocol can be conveniently used for Ribo-Seq data analysis in the command-line Linux environment. Papolarity package is available through Python pip package manager. The source code is available at https://github.com/autosome-ru/papolarity .


Subject(s)
Computational Biology/methods , RNA, Messenger/genetics , Ribosomes/metabolism , Animals , High-Throughput Nucleotide Sequencing , Humans , Linear Models , Poisson Distribution , Protein Biosynthesis , RNA, Messenger/metabolism , Sequence Analysis, RNA , Software
8.
BMC Genomics ; 21(1): 754, 2020 Nov 02.
Article in English | MEDLINE | ID: mdl-33138777

ABSTRACT

BACKGROUND: Efforts to elucidate the function of enhancers in vivo are underway but their vast numbers alongside differing enhancer architectures make it difficult to determine their impact on gene activity. By systematically annotating multiple mouse tissues with super- and typical-enhancers, we have explored their relationship with gene function and phenotype. RESULTS: Though super-enhancers drive high total- and tissue-specific expression of their associated genes, we find that typical-enhancers also contribute heavily to the tissue-specific expression landscape on account of their large numbers in the genome. Unexpectedly, we demonstrate that both enhancer types are preferentially associated with relevant 'tissue-type' phenotypes and exhibit no difference in phenotype effect size or pleiotropy. Modelling regulatory data alongside molecular data, we built a predictive model to infer gene-phenotype associations and use this model to predict potentially novel disease-associated genes. CONCLUSION: Overall our findings reveal that differing enhancer architectures have a similar impact on mammalian phenotypes whilst harbouring differing cellular and expression effects. Together, our results systematically characterise enhancers with predicted phenotypic traits endorsing the role for both types of enhancers in human disease and disorders.


Subject(s)
Enhancer Elements, Genetic , Animals , Enhancer Elements, Genetic/genetics , Humans , Mice , Phenotype
9.
Front Genet ; 10: 1078, 2019.
Article in English | MEDLINE | ID: mdl-31737053

ABSTRACT

Many problems of modern genetics and functional genomics require the assessment of functional effects of sequence variants, including gene expression changes. Machine learning is considered to be a promising approach for solving this task, but its practical applications remain a challenge due to the insufficient volume and diversity of training data. A promising source of valuable data is a saturation mutagenesis massively parallel reporter assay, which quantitatively measures changes in transcription activity caused by sequence variants. Here, we explore the computational predictions of the effects of individual single-nucleotide variants on gene transcription measured in the massively parallel reporter assays, based on the data from the recent "Regulation Saturation" Critical Assessment of Genome Interpretation challenge. We show that the estimated prediction quality strongly depends on the structure of the training and validation data. Particularly, training on the sequence segments located next to the validation data results in the "information leakage" caused by the local context. This information leakage allows reproducing the prediction quality of the best CAGI challenge submissions with a fairly simple machine learning approach, and even obtaining notably better-than-random predictions using irrelevant genomic regions. Validation scenarios preventing such information leakage dramatically reduce the measured prediction quality. The performance at independent regulatory regions entirely excluded from the training set appears to be much lower than needed for practical applications, and even the performance estimation will become reliable only in the future with richer data from multiple reporters. The source code and data are available at https://bitbucket.org/autosomeru_cagi2018/cagi2018_regsat and https://genomeinterpretation.org/content/expression-variants.

10.
BMC Res Notes ; 11(1): 756, 2018 Oct 23.
Article in English | MEDLINE | ID: mdl-30352610

ABSTRACT

OBJECTIVES: Mammalian genomics studies, especially those focusing on transcriptional regulation, require information on genomic locations of regulatory regions, particularly, transcription factor (TF) binding sites. There are plenty of published ChIP-Seq data on in vivo binding of transcription factors in different cell types and conditions. However, handling of thousands of separate data sets is often impractical and it is desirable to have a single global map of genomic regions potentially bound by a particular TF in any of studied cell types and conditions. DATA DESCRIPTION: Here we report human and mouse cistromes, the maps of genomic regions that are routinely identified as TF binding sites, organized by TF. We provide cistromes for 349 mouse and 599 human TFs. Given a TF, its cistrome regions are supported by evidence from several ChIP-Seq experiments or several computational tools, and, as an optional filter, contain occurrences of sequence motifs recognized by the TF. Using the cistrome, we provide an annotation of TF binding sites in the vicinity of human and mouse transcription start sites. This information is useful for selecting potential gene targets of transcription factors and detecting co-regulated genes in differential gene expression data.


Subject(s)
Genome , Sequence Analysis, DNA , Transcription Factors , Animals , Binding Sites , Humans , Mice
11.
Nucleic Acids Res ; 46(D1): D252-D259, 2018 01 04.
Article in English | MEDLINE | ID: mdl-29140464

ABSTRACT

We present a major update of the HOCOMOCO collection that consists of patterns describing DNA binding specificities for human and mouse transcription factors. In this release, we profited from a nearly doubled volume of published in vivo experiments on transcription factor (TF) binding to expand the repertoire of binding models, replace low-quality models previously based on in vitro data only and cover more than a hundred TFs with previously unknown binding specificities. This was achieved by systematic motif discovery from more than five thousand ChIP-Seq experiments uniformly processed within the BioUML framework with several ChIP-Seq peak calling tools and aggregated in the GTRD database. HOCOMOCO v11 contains binding models for 453 mouse and 680 human transcription factors and includes 1302 mononucleotide and 576 dinucleotide position weight matrices, which describe primary binding preferences of each transcription factor and reliable alternative binding specificities. An interactive interface and bulk downloads are available on the web: http://hocomoco.autosome.ru and http://www.cbrc.kaust.edu.sa/hocomoco11. In this release, we complement HOCOMOCO by MoLoTool (Motif Location Toolbox, http://molotool.autosome.ru) that applies HOCOMOCO models for visualization of binding sites in short DNA sequences.


Subject(s)
Databases, Genetic , Transcription Factors/metabolism , Animals , Binding Sites/genetics , Chromatin Immunoprecipitation , Humans , Mice , Models, Genetic , Nucleotide Motifs , Sequence Analysis, DNA
12.
PLoS One ; 12(2): e0172681, 2017.
Article in English | MEDLINE | ID: mdl-28234966

ABSTRACT

We studied functional effect of rs12722489 single nucleotide polymorphism located in the first intron of human IL2RA gene on transcriptional regulation. This polymorphism is associated with multiple autoimmune conditions (rheumatoid arthritis, multiple sclerosis, Crohn's disease, and ulcerative colitis). Analysis in silico suggested significant difference in the affinity of estrogen receptor (ER) binding site between alternative allelic variants, with stronger predicted affinity for the risk (G) allele. Electrophoretic mobility shift assay showed that purified human ERα bound only G variant of a 32-bp genomic sequence containing rs12722489. Chromatin immunoprecipitation demonstrated that endogenous human ERα interacted with rs12722489 genomic region in vivo and DNA pull-down assay confirmed differential allelic binding of amplified 189-bp genomic fragments containing rs12722489 with endogenous human ERα. In a luciferase reporter assay, a kilobase-long genomic segment containing G but not A allele of rs12722489 demonstrated enhancer properties in MT-2 cell line, an HTLV-1 transformed human cell line with a regulatory T cell phenotype.


Subject(s)
Estrogen Receptor alpha/genetics , Interleukin-2 Receptor alpha Subunit/genetics , Polymorphism, Single Nucleotide , Response Elements , T-Lymphocytes, Regulatory/metabolism , Alleles , Base Sequence , Binding Sites , Cell Line, Transformed , Chromatin Immunoprecipitation , Electrophoretic Mobility Shift Assay , Estrogen Receptor alpha/metabolism , Gene Expression Regulation , Genes, Reporter , Human T-lymphotropic virus 1/genetics , Humans , Interleukin-2 Receptor alpha Subunit/metabolism , Introns , Luciferases/genetics , Luciferases/metabolism , Protein Binding , T-Lymphocytes, Regulatory/cytology
13.
Gene ; 602: 50-56, 2017 Feb 20.
Article in English | MEDLINE | ID: mdl-27876533

ABSTRACT

IL2RA gene encodes the alpha subunit of a high-affinity receptor for interleukin-2 which is expressed by several distinct populations of lymphocytes involved in autoimmune processes. A large number of polymorphic alleles of the IL2RA locus are associated with the development of various autoimmune diseases. With bioinformatics analysis we the dissected the first intron of the IL2RA gene and selected several single nucleotide polymorphisms (SNPs) that may influence the regulation of the IL2RA gene in cell types relevant to autoimmune pathology. We described five enhancers containing the selected SNPs that stimulated activity of the IL2RA promoter in a cell-type specific manner, and tested the effect of specific SNP alleles on activity of the respective enhancers (E1 to E5, labeled according to the distance to the promoter). The E4 enhancer with minor T variant of rs61839660 SNP demonstrated reduced activity due to disrupted binding of MEF2A/C transcription factors (TFs). Neither rs706778 nor rs706779 SNPs, both associated with a number of autoimmune diseases, had any effect on the activity of the enhancer E2. However, rare variants of several SNPs (rs139767239, rs115133228, rs12722502, rs12722635) genetically linked to either rs706778 and/or rs706779 significantly influenced the activity of E1, E3 and E5 enhancers, presumably by disrupting EBF1, GABPA and ELF1 binding sites.


Subject(s)
Interleukin-2 Receptor alpha Subunit/genetics , Autoimmune Diseases/genetics , Autoimmune Diseases/immunology , Cell Line , Enhancer Elements, Genetic , Genetic Predisposition to Disease , Humans , Introns , Jurkat Cells , Polymorphism, Single Nucleotide , Promoter Regions, Genetic , T-Lymphocytes, Helper-Inducer/immunology , T-Lymphocytes, Helper-Inducer/metabolism , T-Lymphocytes, Regulatory/immunology , T-Lymphocytes, Regulatory/metabolism , Transcription Factors/metabolism
14.
Biochim Biophys Acta ; 1859(10): 1259-68, 2016 10.
Article in English | MEDLINE | ID: mdl-27424222

ABSTRACT

Signaling lymphocytic activation molecule family member 1 (SLAMF1)/CD150 is a co-stimulatory receptor expressed on a variety of hematopoietic cells, in particular on mature lymphocytes activated by specific antigen, costimulation and cytokines. Changes in CD150 expression level have been reported in association with autoimmunity and with B-cell chronic lymphocytic leukemia. We characterized the core promoter for SLAMF1 gene in human B-cell lines and explored binding sites for a number of transcription factors involved in B cell differentiation and activation. Mutations of SP1, STAT6, IRF4, NF-kB, ELF1, TCF3, and SPI1/PU.1 sites resulted in significantly decreased promoter activity of varying magnitude, depending on the cell line tested. The most profound effect on the promoter strength was observed upon mutation of the binding site for Early B-cell factor 1 (EBF1). This mutation produced a 10-20 fold drop in promoter activity and pinpointed EBF1 as the master regulator of human SLAMF1 gene in B cells. We also identified three potent transcriptional enhancers in human SLAMF1 locus, each containing functional EBF1 binding sites. Thus, EBF1 interacts with specific binding sites located both in the promoter and in the enhancer regions of the SLAMF1 gene and is critical for its expression in human B cells.


Subject(s)
Gene Expression Regulation , Signaling Lymphocytic Activation Molecule Family Member 1/genetics , Trans-Activators/genetics , Transcription, Genetic , B-Lymphocytes/cytology , B-Lymphocytes/metabolism , Basic Helix-Loop-Helix Transcription Factors/genetics , Basic Helix-Loop-Helix Transcription Factors/metabolism , Binding Sites , Cell Line, Tumor , Enhancer Elements, Genetic , Genes, Reporter , HEK293 Cells , Humans , Interferon Regulatory Factors/genetics , Interferon Regulatory Factors/metabolism , Luciferases/genetics , Mutation , NF-kappa B/genetics , NF-kappa B/metabolism , Nuclear Proteins/genetics , Nuclear Proteins/metabolism , Primary Cell Culture , Promoter Regions, Genetic , Protein Binding , Proto-Oncogene Proteins/genetics , Proto-Oncogene Proteins/metabolism , STAT6 Transcription Factor/genetics , STAT6 Transcription Factor/metabolism , Signal Transduction , Signaling Lymphocytic Activation Molecule Family Member 1/metabolism , Sp1 Transcription Factor/genetics , Sp1 Transcription Factor/metabolism , Trans-Activators/metabolism , Transcription Factors/genetics , Transcription Factors/metabolism
15.
BMC Genomics ; 17 Suppl 2: 395, 2016 06 23.
Article in English | MEDLINE | ID: mdl-27356864

ABSTRACT

BACKGROUND: Somatic mutations in cancer cells affect various genomic elements disrupting important cell functions. In particular, mutations in DNA binding sites recognized by transcription factors can alter regulator binding affinities and, consequently, expression of target genes. A number of promoter mutations have been linked with an increased risk of cancer. Cancer somatic mutations in binding sites of selected transcription factors have been found under positive selection. However, action and significance of negative selection in non-coding regions remain controversial. RESULTS: Here we present analysis of transcription factor binding motifs co-localized with non-coding variants. To avoid statistical bias we account for mutation signatures of different cancer types. For many transcription factors, including multiple members of FOX, HOX, and NR families, we show that human cancers accumulate fewer mutations than expected by chance that increase or decrease affinity of predicted binding sites. Such stability of binding motifs is even more exhibited in DNase accessible regions. CONCLUSIONS: Our data demonstrate negative selection against binding sites alterations and suggest that such selection pressure protects cancer cells from rewiring of regulatory circuits. Further analysis of transcription factors with conserved binding motifs can reveal cell regulatory pathways crucial for the survivability of various human cancers.


Subject(s)
DNA/metabolism , Mutation , Neoplasms/genetics , Transcription Factors/metabolism , Binding Sites , DNA/chemistry , DNA/genetics , Humans , Neoplasms/metabolism , Promoter Regions, Genetic , Protein Binding , Selection, Genetic , Transcription Factors/chemistry
16.
Nucleic Acids Res ; 44(D1): D116-25, 2016 Jan 04.
Article in English | MEDLINE | ID: mdl-26586801

ABSTRACT

Models of transcription factor (TF) binding sites provide a basis for a wide spectrum of studies in regulatory genomics, from reconstruction of regulatory networks to functional annotation of transcripts and sequence variants. While TFs may recognize different sequence patterns in different conditions, it is pragmatic to have a single generic model for each particular TF as a baseline for practical applications. Here we present the expanded and enhanced version of HOCOMOCO (http://hocomoco.autosome.ru and http://www.cbrc.kaust.edu.sa/hocomoco10), the collection of models of DNA patterns, recognized by transcription factors. HOCOMOCO now provides position weight matrix (PWM) models for binding sites of 601 human TFs and, in addition, PWMs for 396 mouse TFs. Furthermore, we introduce the largest up to date collection of dinucleotide PWM models for 86 (52) human (mouse) TFs. The update is based on the analysis of massive ChIP-Seq and HT-SELEX datasets, with the validation of the resulting models on in vivo data. To facilitate a practical application, all HOCOMOCO models are linked to gene and protein databases (Entrez Gene, HGNC, UniProt) and accompanied by precomputed score thresholds. Finally, we provide command-line tools for PWM and diPWM threshold estimation and motif finding in nucleotide sequences.


Subject(s)
Databases, Genetic , Regulatory Elements, Transcriptional , Transcription Factors/metabolism , Animals , Binding Sites , Chromatin Immunoprecipitation , Humans , Mice , Models, Biological , Sequence Analysis, DNA
17.
Database (Oxford) ; 2015: bav067, 2015.
Article in English | MEDLINE | ID: mdl-26153137

ABSTRACT

Epigenetics refers to stable and long-term alterations of cellular traits that are not caused by changes in the DNA sequence per se. Rather, covalent modifications of DNA and histones affect gene expression and genome stability via proteins that recognize and act upon such modifications. Many enzymes that catalyse epigenetic modifications or are critical for enzymatic complexes have been discovered, and this is encouraging investigators to study the role of these proteins in diverse normal and pathological processes. Rapidly growing knowledge in the area has resulted in the need for a resource that compiles, organizes and presents curated information to the researchers in an easily accessible and user-friendly form. Here we present EpiFactors, a manually curated database providing information about epigenetic regulators, their complexes, targets and products. EpiFactors contains information on 815 proteins, including 95 histones and protamines. For 789 of these genes, we include expressions values across several samples, in particular a collection of 458 human primary cell samples (for approximately 200 cell types, in many cases from three individual donors), covering most mammalian cell steady states, 255 different cancer cell lines (representing approximately 150 cancer subtypes) and 134 human postmortem tissues. Expression values were obtained by the FANTOM5 consortium using Cap Analysis of Gene Expression technique. EpiFactors also contains information on 69 protein complexes that are involved in epigenetic regulation. The resource is practical for a wide range of users, including biologists, pharmacologists and clinicians.


Subject(s)
Databases, Genetic , Epigenesis, Genetic , Genomic Instability , Histones , Neoplasm Proteins , Neoplasms , Protamines , Epigenomics , Histones/biosynthesis , Histones/genetics , Humans , Neoplasm Proteins/biosynthesis , Neoplasm Proteins/genetics , Neoplasms/genetics , Neoplasms/metabolism , Protamines/genetics , Protamines/metabolism
18.
Algorithms Mol Biol ; 8(1): 23, 2013 Sep 30.
Article in English | MEDLINE | ID: mdl-24074225

ABSTRACT

BACKGROUND: Positional weight matrix (PWM) remains the most popular for quantification of transcription factor (TF) binding. PWM supplied with a score threshold defines a set of putative transcription factor binding sites (TFBS), thus providing a TFBS model.TF binding DNA fragments obtained by different experimental methods usually give similar but not identical PWMs. This is also common for different TFs from the same structural family. Thus it is often necessary to measure the similarity between PWMs. The popular tools compare PWMs directly using matrix elements. Yet, for log-odds PWMs, negative elements do not contribute to the scores of highly scoring TFBS and thus may be different without affecting the sets of the best recognized binding sites. Moreover, the two TFBS sets recognized by a given pair of PWMs can be more or less different depending on the score thresholds. RESULTS: We propose a practical approach for comparing two TFBS models, each consisting of a PWM and the respective scoring threshold. The proposed measure is a variant of the Jaccard index between two TFBS sets. The measure defines a metric space for TFBS models of all finite lengths. The algorithm can compare TFBS models constructed using substantially different approaches, like PWMs with raw positional counts and log-odds. We present the efficient software implementation: MACRO-APE (MAtrix CompaRisOn by Approximate P-value Estimation). CONCLUSIONS: MACRO-APE can be effectively used to compute the Jaccard index based similarity for two TFBS models. A two-pass scanning algorithm is presented to scan a given collection of PWMs for PWMs similar to a given query. AVAILABILITY AND IMPLEMENTATION: MACRO-APE is implemented in ruby 1.9; software including source code and a manual is freely available at http://autosome.ru/macroape/ and in supplementary materials.

19.
Nucleic Acids Res ; 41(Database issue): D195-202, 2013 Jan.
Article in English | MEDLINE | ID: mdl-23175603

ABSTRACT

Transcription factor (TF) binding site (TFBS) models are crucial for computational reconstruction of transcription regulatory networks. In existing repositories, a TF often has several models (also called binding profiles or motifs), obtained from different experimental data. Having a single TFBS model for a TF is more pragmatic for practical applications. We show that integration of TFBS data from various types of experiments into a single model typically results in the improved model quality probably due to partial correction of source specific technique bias. We present the Homo sapiens comprehensive model collection (HOCOMOCO, http://autosome.ru/HOCOMOCO/, http://cbrc.kaust.edu.sa/hocomoco/) containing carefully hand-curated TFBS models constructed by integration of binding sequences obtained by both low- and high-throughput methods. To construct position weight matrices to represent these TFBS models, we used ChIPMunk software in four computational modes, including newly developed periodic positional prior mode associated with DNA helix pitch. We selected only one TFBS model per TF, unless there was a clear experimental evidence for two rather distinct TFBS models. We assigned a quality rating to each model. HOCOMOCO contains 426 systematically curated TFBS models for 401 human TFs, where 172 models are based on more than one data source.


Subject(s)
Databases, Genetic , Regulatory Elements, Transcriptional , Transcription Factors/metabolism , Binding Sites , Humans , Internet , Models, Genetic , Position-Specific Scoring Matrices
SELECTION OF CITATIONS
SEARCH DETAIL
...